Searching for web information more efficiently using presentational layout analysis

نویسندگان

  • Milos Kovacevic
  • Michelangelo Diligenti
  • Marco Gori
  • Veljko M. Milutinovic
چکیده

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a ‘bag of words’ and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object on a page. Using visual information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the centre of a page. Initial experiments have shown that, using our heuristics, defined areas are recognised properly in 73% of cases. Finally, we introduce a classification system which, taking into account the proposed document layout analysis clearly outperforms standard systems by 10% or more.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Indexing Technique for Web Documents using Hierarchical Clustering

The information on the WWW is growing at an exponential rate; therefore, search engines are required to index the downloaded Web documents more efficiently. Web mining techniques like clustering can be used for this purpose. In this paper, a novel technique to index the documents is being proposed that not only indexes the documents more efficiently but also uses hierarchical clustering to keep...

متن کامل

Engineering the Presentation Layer of Adaptable Web Information Systems

Engineering adaptable Web Information Systems (WIS) requires systematic design models and specification frameworks. A complete model-driven methodology like Hera distinguishes between the conceptual, navigational, and presentational aspects of WIS design and identifies different adaptation “hot-spots” in each design step. This paper concentrates on adaptation in the presentation layer and combi...

متن کامل

کاربرد هستی شناسی های وب معنایی در نظام های اطلاع رسانی پزشکی

One of the challenges of current medical information systems which is based on keyword searching, is that it may retrieve a large amount of irrelevant information during searching. Also, these systems don't provide interoperability among healthcare systems. For interfacing these challenges, and for the purposes of more interoperability between user and machine, semantic web (web 3) has been d...

متن کامل

A Web Smart Space Framework for Information Mining: A base for Intelligent Search Engines

A web smart space is an intelligent environment which has additional capability of searching the information smartly and efficiently. New advancements like dynamic web contents generation has increased the size of web repositories. Among so many modern software analysis requirements, one is to search information from the given repository. But useful information extraction is a troublesome hitch...

متن کامل

A Web Smart Space Framework for Intelligent Search Engines

A web smart space is an intelligent environment which has additional capability of searching the information smartly and efficiently. New advancements like dynamic web contents generation has increased the size of web repositories. Among so many modern software analysis requirements, one is to search information from the given repository. But useful information extraction is a troublesome hitch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJEB

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2003